GH-15855: core pipeline API #16039

sebhrusen · 2024-01-29T15:53:06Z

implements core API for #15855

sebhrusen · 2024-01-29T17:22:55Z

h2o-core/src/main/java/water/util/Checksum.java

+   * @param ignoredFields A {@link Set} of fields to ignore. Can be empty or null.
+   * @return checksum A 64-bit long representing the checksum of the object
+   */
+  public static <T> long checksum(final T obj, final Set<String> ignoredFields, final long initVal) {


logic extracted for Model.Parameters so that it can also be used in pipeline DataTransformers

sebhrusen · 2024-01-29T17:32:00Z

h2o-core/src/main/java/hex/ModelBuilder.java

-    private ModelBuilderListener _callback;
-
-    public void setCallback(ModelBuilderListener callback) {
-      this._callback = callback;


callbacks have been moved from Driver to ModelBuilder itself as it allows much better configurability needed by the pipeline

sebhrusen · 2024-02-02T17:18:54Z

h2o-admissibleml/src/main/java/hex/Infogram/Infogram.java

@@ -71,18 +74,23 @@ protected int nModelsInParallel(int folds) {
   * This is called before cross-validation is carried out
   */
  @Override
-  public void computeCrossValidation() {
+  protected void cv_init() {


changes in the algos are mainly due to the fact that the CV API now only exposes protected hooks at various places in the model building cycle, otherwise it breaks the pipeline logic that needs a very strict behaviour when building CV models (esp. as it needs full control over the frames being used at that time).
Algos are therefore encouraged to override only those small hooks, and the ModelBuilder itself remains algo-agnostic.

wendycwong · 2024-02-05T23:21:32Z

h2o-core/src/test/java/hex/pipeline/PipelineTest.java

+      assertNotNull(k.get());
+      assertVecEquals(fr.vec(i), k.get(), 1e-10);
+    }
+  }


can you add an example with pipeline say multiply a column by 2 and then build a GLM or GBM or DRF model?

@wendycwong added a quick MultiplyNumericColumnTransformer for the test below, with assertions verifying that it's applied correctly and that the original frame is not modified.

wendycwong · 2024-02-06T16:40:19Z

@sebhrusen : Is it possible for a user to add her own munging pipeline? For example, I want to transform a column by subtracting the column mean?

sebhrusen · 2024-02-06T17:20:50Z

Is it possible for a user to add her own munging pipeline? For example, I want to transform a column by subtracting the column mean?

@wendycwong no, this is not currently supported:

the current implementation focuses on the backend implementation: the client support is only here to manipulate—esp. be able to predict—pipeline models that have been built by AutoML for example.
there are many ways to extend the pipeline logic and make it more customizable:
1. in AutoML, the preprocessing param can used to select various predefined transformers that will make up the training pipeline.
2. for single models and grids, the pipeline client can later be extended to allow the user to define its own pipeline (for example using a syntax similar to sklearn Pipeline).
3. for even better customization, like the scenario you're suggesting, we could allow code ingestion—probably jython scripts, it should not be too difficult to implement a JythonDataTransformer.
finally, let's keep in mind that Mojo support is also not supported yet, although likely to be much easier to support with this Pipeline mechanism than with the legacy Target encoding support embedded in Model/ModelBuilder for example, as every transformation now applies clearly sequentially and the estimator model (e.g. GLM) contains only post-transformation information, whereas with the legacy TE integration the model contained a mix of pre-encoding and post-encoding making the MOJO extremely difficult to implement due to other categorical encoding being mixed to this.

I'm answering more than your simple question here, but wanted to explain the scope of this, so maybe I will copy this answer to the original issue for reference.

tomasfryda

I have just some minor comments, otherwise it looks great! Thank you @sebhrusen for not just hacking it together (as I probably would 😅 ) and inventing those nice abstractions!

tomasfryda · 2024-02-09T13:09:08Z

h2o-bindings/bin/custom/R/gen_pipeline.py

+    model$estimator_model <- NULL
+  }
+  model$transformers <- unlist(lapply(model$transformers, function(dt) new("H2ODataTransformer", id=dt$id, description=dt$description)))
+  # class(model) <- "H2OPipeline"


Why is this commented? If needed, you can assign multiple classes, e.g., class(model) <- c("H2OPipeline", "H2OModel").

good point! right I forgot about the multiple inheritance. I think I commented it out because it didn't seem necessary (single algo don't have dedicated class) and it broke some behaviour somewhere (can't remember what exactly).
The funny part is that the class is defined as follow:

setClass("H2OPipeline", contains="H2OModel",

but afair, it was still not recognized as a model somewhere…
I will give a try to your suggestion, and see if it breaks some R tests.

h2o-bindings/bin/custom/R/gen_pipeline.py

tomasfryda · 2024-02-09T13:43:54Z

h2o-core/src/main/java/water/KeyGen.java

+
+    private enum Command {
+      SUBSTITUTE() {
+        private final Pattern CMD = Pattern.compile("s/(.*?)/(.*?)/?");


For other reviewers:
It took me couple minutes before I realized this is not a PCRE to substitute a non-greedy match with verbatim (.*?). The whole expression is actually the part for matching (not substitution). The expected string is s/something/something or s/something/something/.

Please correct me if I'm wrong @sebhrusen .

correct, the idea behind those KeyGen is to better understand how all those keys are created, and ensure consistency when some keys are created based on other ones. Pipeline can use a lot of keys for all temp frames, for models used in data transformers and so on.

and example of this substitution pattern is used in the pipeline integration in AutoML:

// in ModelingStep.applyPipeline(...) pparams._estimatorKeyGen = hyperParams == null ? new ConstantKeyGen(resultKey) : new PatternKeyGen("{0}|s/"+PIPELINE_KEY_PREFIX+"//") // in case of grid, remove the Pipeline prefix to obtain the estimator key, this allows naming compatibility with the classic mode. ;

this way the key formatting doesn't have to leak later in the code (in this case in Grid) that should be pipeline-agnostic

tomasfryda · 2024-02-09T13:45:21Z

h2o-core/src/main/resources/META-INF/services/water.AbstractH2OExtension

@@ -0,0 +1 @@
+#hex.pipeline.PipelineRegistration


Intentional # prefix?

oh, right, I should remove this

tomasfryda · 2024-02-09T14:40:40Z

h2o-core/src/main/java/hex/pipeline/transformers/FilteringTransformer.java

+import water.fvec.Frame;
+
+/**
+ * WiP: not used for now.


Could you specify what remains to be done here? (if anything)

it's mainly a high level abstraction for other transformers that would for example remove rows (not) containing a specific value or that have too few values…
As we don't use this in AutoML (yet), I didn't implement any specific transformer, and so this abstraction is probably missing useful stuff…
Basically, it's just an idea for now. I can delete it if you want.

h2o-r/h2o-package/R/kvstore.R

h2o-r/h2o-package/R/classes.R

h2o-core/src/main/java/hex/pipeline/PipelineHelper.java

…olumns transformations can be easily implemented and applied declaratively

Co-authored-by: Tomáš Frýda <[email protected]>

This reverts commit c15ea1e

This reverts commit c15ea1e.

* Revert "GH-15857: cleanup legacy TE integration in ModelBuilder and AutoML (#16061)" This reverts commit a8f309b. * Revert "GH-15857: AutoML pipeline support (#16041)" This reverts commit 17fa9ee. * Revert "GH-15856: Grid pipeline support (#16040)" This reverts commit b7ac670. * Revert "GH-15855: core pipeline API (#16039)" This reverts commit c15ea1e.

sebhrusen added the core label Jan 29, 2024

sebhrusen added this to the 3.46.0.1 milestone Jan 29, 2024

sebhrusen self-assigned this Jan 29, 2024

sebhrusen linked an issue Jan 29, 2024 that may be closed by this pull request

AutoML Pipeline – Java API #15855

Closed

sebhrusen commented Jan 29, 2024

View reviewed changes

sebhrusen force-pushed the seb/gh-15855 branch from 676ad67 to 458f902 Compare February 2, 2024 16:25

sebhrusen commented Feb 2, 2024

View reviewed changes

sebhrusen force-pushed the seb/gh-15855 branch from 16b2444 to 009c161 Compare February 5, 2024 15:29

wendycwong suggested changes Feb 5, 2024

View reviewed changes

wendycwong requested review from mn-mikke, tomasfryda, maurever, valenad1 and syzonyuliia February 5, 2024 23:27

tomasfryda previously approved these changes Feb 9, 2024

View reviewed changes

sebhrusen dismissed tomasfryda’s stale review via c7c5bb4 February 9, 2024 22:15

sebhrusen and others added 11 commits February 11, 2024 17:00

core pipeline API

c73bdc0

remove unnecessary explicit casting

0f2664b

remove grid some integration logic (moving to dedicated PR)

1c5239a

fix ref comparison in PipelineHelperTest

cd15cc9

remove Pipeline from sklearn estimators support

0689c31

fix dynamic test for py pipeline algo

9ca6a30

fix R CRAN check

bd48176

revert changes on Model.scoreMetrics, but extracted suspicious code

196299c

added example of multiplier transformer in PipelineTest to show how c…

1f0dd43

…olumns transformations can be easily implemented and applied declaratively

Apply suggestions from @TomF 's code review

78db7a6

Co-authored-by: Tomáš Frýda <[email protected]>

addressed tomf suggestions

f3d717d

sebhrusen force-pushed the seb/gh-15855 branch from aac3156 to f3d717d Compare February 11, 2024 16:00

wendycwong approved these changes Feb 11, 2024

View reviewed changes

tomasfryda approved these changes Feb 12, 2024

View reviewed changes

sebhrusen merged commit c15ea1e into master Feb 12, 2024
62 of 68 checks passed

sebhrusen deleted the seb/gh-15855 branch February 12, 2024 13:02

sebhrusen mentioned this pull request Feb 12, 2024

GH-15856: Grid pipeline support #16040

Merged

mn-mikke added a commit that referenced this pull request Feb 27, 2024

Revert "GH-15855: core pipeline API (#16039)"

fa54892

This reverts commit c15ea1e

valenad1 added a commit that referenced this pull request Mar 8, 2024

Revert "GH-15855: core pipeline API (#16039)"

224c5df

This reverts commit c15ea1e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15855: core pipeline API #16039

GH-15855: core pipeline API #16039

sebhrusen commented Jan 29, 2024

sebhrusen Jan 29, 2024

sebhrusen Jan 29, 2024 •

edited

Loading

sebhrusen Feb 2, 2024 •

edited

Loading

wendycwong Feb 5, 2024

sebhrusen Feb 6, 2024

wendycwong commented Feb 6, 2024

sebhrusen commented Feb 6, 2024 •

edited

Loading

tomasfryda left a comment

tomasfryda Feb 9, 2024

sebhrusen Feb 9, 2024

tomasfryda Feb 9, 2024

sebhrusen Feb 9, 2024 •

edited

Loading

tomasfryda Feb 9, 2024

sebhrusen Feb 9, 2024

tomasfryda Feb 9, 2024

sebhrusen Feb 9, 2024

		@@ -0,0 +1 @@
		#hex.pipeline.PipelineRegistration

GH-15855: core pipeline API #16039

GH-15855: core pipeline API #16039

Conversation

sebhrusen commented Jan 29, 2024

Choose a reason for hiding this comment

sebhrusen Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

sebhrusen Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wendycwong commented Feb 6, 2024

sebhrusen commented Feb 6, 2024 • edited Loading

tomasfryda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebhrusen Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebhrusen Jan 29, 2024 •

edited

Loading

sebhrusen Feb 2, 2024 •

edited

Loading

sebhrusen commented Feb 6, 2024 •

edited

Loading

sebhrusen Feb 9, 2024 •

edited

Loading